Dataset Description
A total of 7880 individuals from 2611 families were genotyped on the Illumina Human1M-Duov3_B or the Human1Mv1_C.
- 4901 males, 2979 females.
- 2571 trios, 36 quads, 1 pentas, 3 hexs.
- 947,233 SNPs were genotyped.
- Coordinates were based on Build36.
Raw Genotype QC
Sex Check
- 141 PRROBLEM
- 115 with complete missing chrX genotypes.
- 26 with chrX-F ranging from 0.20 to 0.62
ChrX F distributions

Pariwise IBD estimation
- Relationships (RT): OT (Others), FS (Full Siblings), PO (Parent Offspring)
- family ID 483 & 1012 has potential issue
- FID:483 with IBD sharing = 1 between IID:328 (Female) and IID:1491 (Female), which were supposed to be ~0. Are they MZ? same individual? The genotype missing rates are 0.1185 and 0.1186 for IID:483_328 and IID:483_1491, respectively. [Drop IID:483_1491, IID:483_4371 and IID:483_993.]
- FID:1012 with IBD sharing = 0.57 between IID:2319 (Female) and IID: 3612 (Female). They are supposed to be FS but recruited into the same FID. Other kinship between FID:1012 can be confirmed.
- IBS sharing for other pairs: ranging from 0.44 to 0.58 in FS, from 0.50 to 0.59 in PO, from 0 to 0.12 in OT (which indicating inbreeding between some parents.)
Estimated pairwise IBD distributions

Individual genome-wide heterozygosity
Genome-wide heterozygosity VS missing rates

Note that samples were genotyped in the Human1M-Duov3_B or the Human1Mv1_C. Genotypes for these individuals are an union of the genotypes from both platforms.
- In the final merged dataset 947,233 markers are represented. Of these 938,130 are represented on the Human1M-Duov3_B while 858,052 are on the Human1Mv1_C.
- For individuals on the Human1M-Duov3_B the missingness rate is at least 1% while for individuals genotyped on the Human1Mv1_C it is at least 9%.
- Individuals were checked to have no more than a 5% missingness rate for the platform they were genotyped on.
Genome-wide F VS missing rates

Imputation
Pre-imputation
The imputation pipeline follows that used for SSC dataset. A total of 7769 individuals and ~784K autosomal, ~22K chrX SNPs were used for further impution.
- filters: --geno 0.05 --mind 0.2 --maf 0.01 --hwe 1e-6
- 111 people removed due to missing genotype data (–mind).
- Total genotyping rate in remaining samples is 0.914029.
- 124565 variants removed due to missing genotype data (–geno).
- 15633 variants removed due to Hardy-Weinberg exact test.
Note that a liberal threshold 0.2 was used for individual genotype missing rates (–mind) for AGP data here since, a large number of individuals with missingness rates > 0.1. 111 people with missingness rates ranging from 0.7 to 1. As noted in Section 2.3.1, the samples were combined from two genotype arrays. Before merging the genotypes, the missingness rates were confirmed to be < 0.1.
After Imputation
Frequency distribution
- ~7.6M SNPs overlapped SNPs between AGP_imputed and HRC_WGS (passing filters: --geno 0.05 --maf 0.01 --hwe 1e-6)
- based on same allele
- 0 SNPs with MAF difference > 0.2

PCA
- Project the first 3 PCs based on pruned HapMap3 SNPs onto 1000G
- Using K-means to calculate distance
- Assign ancestry based on posterior probability 0.9
- 6548 Europeans (EUR), 625 Americans (AMR), 123 South-Asians (SAS), 99 East-Asians (EAS) and 141 Africans (AFR).
